Gradient descent

Machine learning, Neural network

An optimization method to find the local minimum. Let’s start with the simplest one dimensional case. Assume a differentiable function . If we take the derivative at a certain point , the value of the derivative at , will have a postive value if is increasing with and vice versa. The magnitude of the derivative will be larger if the increase is stiff. If we want to move towards the minimum, we can then move to the opposite direction that the derivative is point to ().

If we increase the dimension, the same logic applies. Assume a differentiable function . Then points to the exact direction that points to the steepest descent because larger derivative means stiffer slope. You can also think about each dimension independently.

Graident descent calculates, given and a starting point , the next point ( is a learning rate) and repeat this process to arrive at the local minimum. Note that if the learning rate is too large, it can overshoot whereas tiny learning rate can make the process very slow.

In a Machine learning model, can be an error function that captures the prediction error given a set of parameters .

When there are lots of data points, the calculation of the gradient can take a long time. A nice method to overcome this challenge is Stochastic gradient descent, which estimates the gradient by using small random batches. By adding noise, it also adds self-regularizing property and make it more robust.

Lectures